Teraflops Supercomputer: Architecture and Validation of the Fault Tolerance Mechanisms

نویسنده

  • Cristian Constantinescu
چکیده

ÐIntel Corporation developed the Teraflops supercomputer for the US Department of Energy (DOE) as part of the Accelerated Strategic Computing Initiative (ASCI). This is the most powerful computing machine available today, performing over two trillion floating point operations per second with the aid of more than 9,000 Intel processors. The Teraflops machine employs complex hardware and software fault/error handling mechanisms for complying with DOE's reliability requirements. This paper gives a brief description of the system architecture and presents the validation of the fault tolerance mechanisms. Physical fault injection at the IC pin level was used for validation purposes. An original approach was developed for assessing signal sensitivity to transient faults and the effectiveness of the fault/error handling mechanisms. Dependency between fault/error detection coverage and fault duration was also determined. Fault injection experiments unveiled several malfunctions at the hardware, firmware, and software levels. The supercomputer performed according to the DOE requirements after corrective actions were implemented. The fault injection approach presented in this paper can be used for validation of any fault-tolerant or highly available computing system. Index TermsÐSupercomputing, fault-tolerant computing, validation, fault injection, fault/error detection coverage.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-Layer Fault Tolerance for Distributed Real-Time Systems

This thesis addresses issues in building fault-tolerant distributed real-time systems. Such systems are increasingly deployed in automotive and avionics applications. We focus on the design and validation of fault tolerance mechanisms. From the design viewpoint, we develop the notion of multi-layer fault tolerance. A fault-tolerant distributed system contains a set of mechanisms that provide er...

متن کامل

The 1 Teraflops QCDSP computer

The QCDSP computer (Quantum Chromodynamics on Digital Signal Processors) is an inexpensive, massively parallel computer intended primarily for simulations in lattice gauge theory. Currently, two large QCDSP machines are in full-time use: an 8,192 processor, 0.4 Teraflops machine at Columbia University and an 12,288 processor, 0.6 Teraflops machine at the RIKEN-BNL Research Center at Brookhaven ...

متن کامل

Replica Management in Real-Time Ada 95 Application

In this paper, we present some of the fault tolerance management mechanisms being implemented in the Multi-μ architecture, namely its support for replica non-determinism. In this architecture, fault tolerance is achieved by node active replication, with software based replica management and fault tolerance transparent algorithms. A software layer implemented between the application and the real...

متن کامل

Understanding Communication Faults in Parallel Computers

This paper addresses the evaluation of the dependability properties of distributed memory parallel systems through fault injection. The most popular parallel computers are based on the distributed memory architecture where loosely coupled processors communicate by message-passing. Fault tolerance is an issue which increasingly concerns manufacturers and end users of these systems as the probabi...

متن کامل

A generalized ABFT technique using a fault tolerant neural network

In this paper we first show that standard BP algorithm cannot yeild to a uniform information distribution over the neural network architecture. A measure of sensitivity is defined to evaluate fault tolerance of neural network and then we show that the sensitivity of a link is closely related to the amount of information passes through it. Based on this assumption, we prove that the distribu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Trans. Computers

دوره 49  شماره 

صفحات  -

تاریخ انتشار 2000